library(ggplot2)
library(ggcorrplot)
library(dplyr)
library(car)
library(psych)
library(gridExtra)
library(vtable)
Read data
data <- read.csv(file="ames_housing_data.csv",head=TRUE,sep=",")
For the first step in our exploratory data analysis, we take a closer look at our data and try to understand whether the sample we are working with is appropriate for our analysis.
For this project, our ask is to eventually predict housing prices for a “typical” home in Ames, Iowa. While typical means something different for everyone, there are obvious things that can be teased out that may be inappropriate and can be removed. Inspecting the data dictionary, a few key variables stand out.
The Zoning variable includes codes for non-residential homes (A - agriculture, C - Commercial, I - industrial). As we are predicting prices for a ‘typical’ home, we should remove these. This makes up about 29 observations.
The Sale condition also indicates the data includes “foreclosure” homes (Abnorml). As these homes are typically bank owned and marked down, they do not represent “typical” house prices. These can also be removed. This represents 190 observations.
All the other variables pertain to residential homes and can be kept. While there may be “outliers”, such as price that we haven’t investigated yet,I’m keeping those for now at this step as we do not know if they are outliers or not.
In total, including the intersection of the above observations that are not typical, we lose 208 observations. Of the 2930 original observatons, we now have 2722, so we lost under 8% of our dataset.
#Checking frequency of each zone.
table(data$Zoning)
##
## A (agr) C (all) FV I (all) RH RL RM
## 2 25 139 2 27 2273 462
table(data$SaleCondition)
##
## Abnorml AdjLand Alloca Family Normal Partial
## 190 12 24 46 2413 245
#removing the zones that should not be included in the anlaysis.
data <- data %>% filter(Zoning %in% c("RL", "RH", "FV", "RM", "RP"))
data <- data %>% filter(SaleCondition %in% c('Normal', 'AdjLand', 'Alloca', 'Family', 'Partial'))
This step of our EDA focuses in to the data itself and checks for any errors. We inspect 20 independent variables at this step for any errors.
The Variables chosen:
Y - Sale Price
X - nominal - subclass, zoning, neighbordhood, foundation, CentralAir
X - continuous - BsmtFin Poolarea MiscVal lotfrontage 1stFlrSf
X - Oridinal - LotShape, Utilities, ExterQual, Electrical, FireplaceQu
X - discrete -Yrsold, YrBuilt, TotRmsAbvGrd, Fireplaces, GarageCars
#combining the 20 variables
subdata <- data[,c("SalePrice",
"SubClass",
"Zoning",
"Neighborhood",
"Foundation",
"CentralAir",
"BsmtFinType1",
"PoolArea",
"MiscVal",
"LotFrontage",
"FirstFlrSF",
"LotShape",
"Utilities",
"ExterQual",
"Electrical",
"FireplaceQu",
"YrSold",
"YearBuilt",
"TotRmsAbvGrd",
"Fireplaces",
"GarageCars")]
print(vtable(subdata))
##
##
## Name Class Values
## ------------- -------- ----------------------------------------------------------
## SalePrice integer Num: 35000 to 755000
## SubClass integer Num: 20 to 190
## Zoning factor 'A (agr)' 'C (all)' 'FV' 'I (all)' 'RH' and more
## Neighborhood factor 'Blmngtn' 'Blueste' 'BrDale' 'BrkSide' 'ClearCr' and more
## Foundation factor 'BrkTil' 'CBlock' 'PConc' 'Slab' 'Stone' and more
## CentralAir factor 'N' 'Y'
## BsmtFinType1 factor '' 'ALQ' 'BLQ' 'GLQ' 'LwQ' and more
## PoolArea integer Num: 0 to 800
## MiscVal integer Num: 0 to 17000
## LotFrontage integer Num: 21 to 313
## FirstFlrSF integer Num: 334 to 5095
## LotShape factor 'IR1' 'IR2' 'IR3' 'Reg'
## Utilities factor 'AllPub' 'NoSeWa' 'NoSewr'
## ExterQual factor 'Ex' 'Fa' 'Gd' 'TA'
## Electrical factor '' 'FuseA' 'FuseF' 'FuseP' 'Mix' and more
## FireplaceQu factor 'Ex' 'Fa' 'Gd' 'Po' 'TA'
## YrSold integer Num: 2006 to 2010
## YearBuilt integer Num: 1872 to 2010
## TotRmsAbvGrd integer Num: 2 to 15
## Fireplaces integer Num: 0 to 4
## GarageCars integer Num: 0 to 5
Our first step is to look for ’N/A’s in our variables. The summary command gives a broad overview of our data which also includes N/A’s. BsmtFinType1 has about 79 N/As. the N/A’s could be missing values because they were not captured, or they could be missing because houses without a basement would be coded as ’N/A’s (as stated in the dictionary). A quick confirmation would be to compare BsmtFinType1 with BsmtCond to see if there are any intersection of an existing quality with a “No Basement”. Our investigation shows that the NA for BsmtFinType1 appears valid.
The same logic applies to FireplaceQu where the NA’s are valid.
LotFrontage appears to have 490 missing values. As this is a continuous variable that represents feet between street and property, this could be an error that may require imputation.
The summary command also tells us the average value of the variables. One thing to notice is that the average sale price is $180,000 while the median value is 160,000. As these 2 values are relatively close, this tells me that there aren’t many high value homes that is skewing the mean value up.
#Checking for N/As.
sapply(X = subdata, FUN = function(x) sum(is.na(x)))
## SalePrice SubClass Zoning Neighborhood Foundation
## 0 0 0 0 0
## CentralAir BsmtFinType1 PoolArea MiscVal LotFrontage
## 0 73 0 0 462
## FirstFlrSF LotShape Utilities ExterQual Electrical
## 0 0 0 0 0
## FireplaceQu YrSold YearBuilt TotRmsAbvGrd Fireplaces
## 1282 0 0 0 0
## GarageCars
## 1
#summary of all data
summary(subdata)
## SalePrice SubClass Zoning Neighborhood
## Min. : 35000 Min. : 20.00 A (agr): 0 NAmes : 404
## 1st Qu.:132000 1st Qu.: 20.00 C (all): 0 CollgCr: 260
## Median :165000 Median : 50.00 FV : 134 OldTown: 213
## Mean :184273 Mean : 57.35 I (all): 0 Edwards: 177
## 3rd Qu.:216959 3rd Qu.: 70.00 RH : 20 Somerst: 177
## Max. :755000 Max. :190.00 RL :2147 NridgHt: 164
## RM : 421 (Other):1327
## Foundation CentralAir BsmtFinType1 PoolArea
## BrkTil: 273 N: 155 GLQ :831 Min. : 0.000
## CBlock:1125 Y:2567 Unf :776 1st Qu.: 0.000
## PConc :1264 ALQ :389 Median : 0.000
## Slab : 47 Rec :262 Mean : 2.023
## Stone : 8 BLQ :249 3rd Qu.: 0.000
## Wood : 5 (Other):142 Max. :800.000
## NA's : 73
## MiscVal LotFrontage FirstFlrSF LotShape
## Min. : 0.00 Min. : 21.00 Min. : 334.0 IR1: 929
## 1st Qu.: 0.00 1st Qu.: 58.00 1st Qu.: 879.2 IR2: 74
## Median : 0.00 Median : 68.00 Median :1091.0 IR3: 16
## Mean : 53.33 Mean : 69.19 Mean :1164.3 Reg:1703
## 3rd Qu.: 0.00 3rd Qu.: 80.00 3rd Qu.:1393.5
## Max. :17000.00 Max. :313.00 Max. :5095.0
## NA's :462
## Utilities ExterQual Electrical FireplaceQu YrSold
## AllPub:2721 Ex: 106 : 1 Ex : 39 Min. :2006
## NoSeWa: 0 Fa: 25 FuseA: 161 Fa : 68 1st Qu.:2007
## NoSewr: 1 Gd: 956 FuseF: 44 Gd : 717 Median :2008
## TA:1635 FuseP: 6 Po : 43 Mean :2008
## Mix : 0 TA : 573 3rd Qu.:2009
## SBrkr:2510 NA's:1282 Max. :2010
##
## YearBuilt TotRmsAbvGrd Fireplaces GarageCars
## Min. :1872 Min. : 2.000 Min. :0.0000 Min. :0.000
## 1st Qu.:1955 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.:1.000
## Median :1976 Median : 6.000 Median :1.0000 Median :2.000
## Mean :1973 Mean : 6.459 Mean :0.6139 Mean :1.796
## 3rd Qu.:2002 3rd Qu.: 7.000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :2010 Max. :15.000 Max. :4.0000 Max. :5.000
## NA's :1
table(data$BsmtFinType1,data$BsmtCond)
##
## Ex Fa Gd Po TA
## 0 0 0 0 0 0
## ALQ 0 1 6 16 0 366
## BLQ 0 1 8 6 1 233
## GLQ 0 1 1 57 1 771
## LwQ 0 0 19 7 0 116
## Rec 0 0 11 4 0 247
## Unf 0 0 39 26 2 709
table(data$FireplaceQu,data$Fireplaces)
##
## 0 1 2 3 4
## Ex 0 33 5 1 0
## Fa 0 59 8 1 0
## Gd 0 609 104 4 0
## Po 0 43 0 0 0
## TA 0 478 89 5 1
Next we run a loop to plot a histogram or a barplot for the continuous and categorical variables. This gives us a decent view of the distribution of each variable. For the most part, there is an imbalance across most variables, but nothing stands out as an obvious error. For example, there are some extremely high priced homes, but those are not isolated.
options(scipen = 999)
#Histogram and bar charts for distriubtion
for(i in 1:length(subdata)){
if (class(subdata[,i]) == "integer"){
print(
ggplot(subdata, aes(x=subdata[,i])) + geom_histogram(fill = "purple") + theme_classic() +
labs(x=colnames(subdata[i]),
y="Frequency",title = paste0("Frequency of"," ",colnames(subdata[i]))))
} else {
print(
ggplot(subdata, aes(x=subdata[,i])) + geom_bar(fill = "purple") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x=colnames(subdata[i]),
y="Frequency",title = paste0("Frequency of"," ",colnames(subdata[i]))))
}
}
## Warning: Removed 462 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
We can also pick out a certain variable that pops to us and check it’s distribution across the Sale price. For example, The top 5 neighborhood ( in frequency counts) will most likely have an effect on the overall average. We can overlay it’s price distribution to see what it shows.
By doing this, what we see that N. Ames has a distribution centered between 100,000-200,000 while Somerst homes tend to be more expensive (or more uniformly distributed across the higher sale price)
Top5Neighborhood <- subdata %>%
filter(Neighborhood %in% c('NAmes', 'CollgCr', 'OldTown', 'Somerst', 'Edwards'))
ggplot(Top5Neighborhood, aes(x=SalePrice, fill = Neighborhood)) +
geom_histogram(color = "black", bins = 50) + theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs(x="Neighborhood",
y="Frequency",
title = "Sale Price disstribution of top 5 Neighborhood")
A boxplot also shows us where each data point lies relative to sale price and tells us about potential outliers. The boxplot gives us a glimpse of any outliers, and there are quite a few. The distinction between these outliers is that they may be outliers in the sense of it should not be included as a ‘typical’ home, as opposed to calling them clerical error.
The boxplot also lets us easily compare the ranges of each grouping. For example, we can see which neighbordhood has the highest and lowest median prices. An interesting observation is that Veenker has one of the higher median prices and least # of outliers. This potentially tells us that the this neighborhood generally has high priced homes.
for(i in 1:length(subdata)){
if (class(subdata[,i]) == "integer"){
next
} else {
print(
ggplot(subdata, aes(x=subdata[,i], y=SalePrice)) + geom_violin(fill = "red") + geom_boxplot(fill = "purple") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x=colnames(subdata[i]),
y="Sale Price",title = paste0("Boxplot of"," ",colnames(subdata[i]))))
}
}
A scatterplot is also created to see any bivariate and individual observations that may lie outside of ‘typical’. A review of the scatter plot leads me to think about a few things:
A ‘typical’ home can be thought of as a typical home, from the entire sample, or a typical home given a certain variable.
For example, most homes do not have a pool. Should we be excluding the ones that do have a pool? if we look at homes without a pool, there are only a couple that are over $600,000. Should we be considering these as typical in the homes with a pool?
There is a couple points that have a first floor square footage of > 5000, while everyone else clusters between 1000-2000 square feet. it may be worthwhile to remove the homes greater than 5000 sqft.
Same concept with Lotfrontage where there are a couple points > 300, while the majority is less than that.
for(i in 1:length(subdata)){
print(
ggplot(subdata, aes(x=subdata[,i], y=SalePrice)) + geom_point(col = "purple") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x=colnames(subdata[i]),
y="Sale Price",title = paste0("Scatterplot of"," ",colnames(subdata[i]))))
}
## Warning: Removed 462 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
With the variables chosen, one last check I did was to compare the Year sold vs Year built. There is one observation where the year sold is before the year built. While it is possible that this person purchased the home prior it being built, there is only 1 instance of that. this could be an error and the year sold should be adjusted to be equal to year built.
#checking for wierd clerical errors
#Sold vs Built. sold should be after built
subdata$soldVsBuilt <- subdata$YrSold - subdata$YearBuilt
subset(subdata, soldVsBuilt < 0)
## SalePrice SubClass Zoning Neighborhood Foundation CentralAir
## 2044 183850 20 RL Edwards PConc Y
## BsmtFinType1 PoolArea MiscVal LotFrontage FirstFlrSF LotShape
## 2044 GLQ 0 17000 128 5095 IR1
## Utilities ExterQual Electrical FireplaceQu YrSold YearBuilt
## 2044 AllPub Ex SBrkr Gd 2007 2008
## TotRmsAbvGrd Fireplaces GarageCars soldVsBuilt
## 2044 15 2 3 -1
Our data quality check can also include checking for normality. Normality is important as many models and statistical techniques assume the normal distribution. If normality is not achieved, certain statistical tests will not be as robust, and predictive models may not perform as well. The way to overcome this would be to use non-parametric techniques to look into transformation to make the variables more ‘normal’.
For the most part, all of our continuous variable do deviate from the normal distribution. This tells us that there may be opportunity for potential transformation of the variable to get it closer to a normal distribution. We would also want to factor this into consideration when doing statistical tests.
for(i in 1:length(subdata)){
if (class(subdata[,i]) == "integer"){
print(
ggplot(subdata, aes(sample=subdata[,i])) +
stat_qq(col = "purple") +
stat_qq_line(col = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="theoritical",
y="sample",title = paste0("Q-Q to test for Normality of "," ",colnames(subdata[i]))))
} else {
next
}
}
## Warning: Removed 462 rows containing non-finite values (stat_qq).
## Warning: Removed 462 rows containing non-finite values (stat_qq_line).
## Warning: Removed 1 rows containing non-finite values (stat_qq).
## Warning: Removed 1 rows containing non-finite values (stat_qq_line).
For the actual Initial Exploratory data analysis, we pick 10 variables and focus on their relation with others and in particular the response variable.
The 10 variables chosen are
Y = SALE PRICE
X - nominal - subclass, zoning, neighbordhood
X - continuous - BsmtFin Poolarea
X - Oridinal - LotShape, Utilities
X - discrete -Yrsold, YrBuilt, TotRmsAbvGrd, Fireplaces
subdata2 <- data[,c("SalePrice",
"SubClass",
"Zoning",
"Neighborhood",
"BsmtFinType1",
"PoolArea",
"LotShape",
"Utilities",
"YearBuilt",
"TotRmsAbvGrd",
"Fireplaces"
)]
print(vtable(subdata2))
##
##
## Name Class Values
## ------------- -------- ----------------------------------------------------------
## SalePrice integer Num: 35000 to 755000
## SubClass integer Num: 20 to 190
## Zoning factor 'A (agr)' 'C (all)' 'FV' 'I (all)' 'RH' and more
## Neighborhood factor 'Blmngtn' 'Blueste' 'BrDale' 'BrkSide' 'ClearCr' and more
## BsmtFinType1 factor '' 'ALQ' 'BLQ' 'GLQ' 'LwQ' and more
## PoolArea integer Num: 0 to 800
## LotShape factor 'IR1' 'IR2' 'IR3' 'Reg'
## Utilities factor 'AllPub' 'NoSeWa' 'NoSewr'
## YearBuilt integer Num: 1872 to 2010
## TotRmsAbvGrd integer Num: 2 to 15
## Fireplaces integer Num: 0 to 4
We split up our data into 2 sets (numeric and categorical) for easier exploration.
numeric_subdata2 <- subdata2[,c("SalePrice","SubClass", "PoolArea", "YearBuilt", "TotRmsAbvGrd", "Fireplaces")]
print(vtable(numeric_subdata2))
##
##
## Name Class Values
## ------------- -------- ---------------------
## SalePrice integer Num: 35000 to 755000
## SubClass integer Num: 20 to 190
## PoolArea integer Num: 0 to 800
## YearBuilt integer Num: 1872 to 2010
## TotRmsAbvGrd integer Num: 2 to 15
## Fireplaces integer Num: 0 to 4
cat_subdata2 <- subdata2[,c("SalePrice","Zoning","Neighborhood", "BsmtFinType1", "LotShape", "Utilities")]
print(vtable(cat_subdata2))
##
##
## Name Class Values
## ------------- -------- ----------------------------------------------------------
## SalePrice integer Num: 35000 to 755000
## Zoning factor 'A (agr)' 'C (all)' 'FV' 'I (all)' 'RH' and more
## Neighborhood factor 'Blmngtn' 'Blueste' 'BrDale' 'BrkSide' 'ClearCr' and more
## BsmtFinType1 factor '' 'ALQ' 'BLQ' 'GLQ' 'LwQ' and more
## LotShape factor 'IR1' 'IR2' 'IR3' 'Reg'
## Utilities factor 'AllPub' 'NoSeWa' 'NoSewr'
We first look at the correlation between the numeric variables. A high correlation tells us the relationship between each variable and can also lead us to potentially drop one when we perform our regression model.
For correlation, we first check the assumptions that usually need to be satisfied:
1. variables are coninuous (this is confirmed)
2. No outliers as defined as +/- 3.29 deviation from the mean.
3. Linearity. A straightline relationship.
A scatter plot between the dependent and the independent variable do show somewhat of a linear relationship, but at the tail end of some segments, data becomes sparse. Even with limited data, you do see a relationship. visually, the variable that has the most data and looks most credible is year built vs price, where the newer the home, the more expensive the home is.
An interesting note is to also look at the confidence interval of this relationship. While you see a relationship between Pool area and price, the confidence interval band is wide and can potentially points down within. Compared this to year built, you see a very confident and direct relationship with the tighter band
for(i in 1:length(numeric_subdata2)){
print(
ggplot(numeric_subdata2, aes(x=numeric_subdata2[,i], y=SalePrice)) + geom_point(col = "purple") + theme_classic() + geom_smooth(method='lm', fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x=colnames(numeric_subdata2[i]),
y="Sale Price",title = paste0("Scatter plot with smoother of"," ",colnames(subdata[i]))))
}
We look at the scatter plot between interactions of the independent variables and do not notice an obvious linear relationship. Most of the scatter smooth line is flat (aside from the ones involving price). This is good to keep in mind as we build out our correllogram and judge it’s correlation value. These charts are in the appendix for seperate viewing.
A quick look at the standard deviation presents the following to tell us how far we deviate from the mean.
describe(numeric_subdata2, skew=FALSE)
## vars n mean sd min max range se
## SalePrice 1 2722 184272.82 78894.77 35000 755000 720000 1512.18
## SubClass 2 2722 57.35 42.26 20 190 170 0.81
## PoolArea 3 2722 2.02 34.00 0 800 800 0.65
## YearBuilt 4 2722 1972.68 29.88 1872 2010 138 0.57
## TotRmsAbvGrd 5 2722 6.46 1.56 2 15 13 0.03
## Fireplaces 6 2722 0.61 0.65 0 4 4 0.01
With the assumptions completed, we now look at our correlogram.
Initial look at the correlogram follows intution. There is a positive correlation to sale price based on Year built, # of rooms above ground, and # of fireplaces. This makes sense new homes are more expensive, multiroom houses cost more, and a house with a fireplace would cost more than a house without. This is in line with the prior scatterplots. There is a couple negative correlation for subclass, which says the higher the subclass, the lower the sale price. This also makes sense as the data dictionary points out that the lower subclass are the single family home while the higher subclass are duplexes and conversions.
corr <-cor(numeric_subdata2)
ggcorrplot(corr, hc.order=TRUE,
type ="lower",
lab = TRUE,
lab_size=3,
method = "circle",
colors = c("red", "white", "purple"),
title = "Correlogram of Housing prices",
ggtheme = theme_classic)
#4B INITIAL EXPLORATORY DATA ANALYSIS - CATEGORICAL VARIABLES
For categorical variables, we are doing a similar type of analysis. For a categorical to categorical comparison, we would use the chi square test to test independence. However, an initial run of the chi.square shows us a warning as the sample size is small. In this case, we use the Fisher exact test.
Fisher test shows the following interactions with a p-value of >.05: LotShape/Utilities, BasementFinishType/Utilities, Neighborhood/Utilities.
For the rest of the combinations, we would reject the NULL hypothesis that they are independent from each other (I.e, they may be dependent). For the 3 pairs listed above, we fail to reject that they are independent. This again, can potentially follow intution. Utilities naturally may not have any relationship with the type of house that one would purchase.
#run chi.sq on all vcategorical variables.
#Chi Square test
chisq.test(cat_subdata2$Zoning,cat_subdata2$Neighborhood)
## Warning in chisq.test(cat_subdata2$Zoning, cat_subdata2$Neighborhood): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$Zoning and cat_subdata2$Neighborhood
## X-squared = 4190.6, df = 81, p-value < 0.00000000000000022
chisq.test(cat_subdata2$Zoning,cat_subdata2$BsmtFinType1)
## Warning in chisq.test(cat_subdata2$Zoning, cat_subdata2$BsmtFinType1): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$Zoning and cat_subdata2$BsmtFinType1
## X-squared = 191.19, df = 15, p-value < 0.00000000000000022
chisq.test(cat_subdata2$Zoning,cat_subdata2$LotShape)
## Warning in chisq.test(cat_subdata2$Zoning, cat_subdata2$LotShape): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$Zoning and cat_subdata2$LotShape
## X-squared = 185.02, df = 9, p-value < 0.00000000000000022
chisq.test(cat_subdata2$Zoning,cat_subdata2$Utilities)
## Warning in chisq.test(cat_subdata2$Zoning, cat_subdata2$Utilities): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$Zoning and cat_subdata2$Utilities
## X-squared = 0.26791, df = 3, p-value = 0.9659
chisq.test(cat_subdata2$Neighborhood,cat_subdata2$BsmtFinType1)
## Warning in chisq.test(cat_subdata2$Neighborhood,
## cat_subdata2$BsmtFinType1): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$Neighborhood and cat_subdata2$BsmtFinType1
## X-squared = 1469, df = 135, p-value < 0.00000000000000022
chisq.test(cat_subdata2$Neighborhood,cat_subdata2$LotShape)
## Warning in chisq.test(cat_subdata2$Neighborhood, cat_subdata2$LotShape):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$Neighborhood and cat_subdata2$LotShape
## X-squared = 626.8, df = 81, p-value < 0.00000000000000022
chisq.test(cat_subdata2$Neighborhood,cat_subdata2$Utilities)
## Warning in chisq.test(cat_subdata2$Neighborhood, cat_subdata2$Utilities):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$Neighborhood and cat_subdata2$Utilities
## X-squared = 15.705, df = 27, p-value = 0.9583
chisq.test(cat_subdata2$BsmtFinType1,cat_subdata2$LotShape)
## Warning in chisq.test(cat_subdata2$BsmtFinType1, cat_subdata2$LotShape):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$BsmtFinType1 and cat_subdata2$LotShape
## X-squared = 74.444, df = 15, p-value = 0.0000000007133
chisq.test(cat_subdata2$BsmtFinType1,cat_subdata2$Utilities)
## Warning in chisq.test(cat_subdata2$BsmtFinType1, cat_subdata2$Utilities):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$BsmtFinType1 and cat_subdata2$Utilities
## X-squared = 2.4146, df = 5, p-value = 0.7893
chisq.test(cat_subdata2$LotShape,cat_subdata2$Utilities)
## Warning in chisq.test(cat_subdata2$LotShape, cat_subdata2$Utilities): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cat_subdata2$LotShape and cat_subdata2$Utilities
## X-squared = 1.9307, df = 3, p-value = 0.5869
#Fisher Test
fisher.test(cat_subdata2$Zoning,cat_subdata2$Neighborhood, simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$Zoning and cat_subdata2$Neighborhood
## p-value = 0.0004998
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$Zoning,cat_subdata2$BsmtFinType1,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$Zoning and cat_subdata2$BsmtFinType1
## p-value = 0.0004998
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$Zoning,cat_subdata2$LotShape,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$Zoning and cat_subdata2$LotShape
## p-value = 0.0004998
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$Zoning,cat_subdata2$Utilities,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$Zoning and cat_subdata2$Utilities
## p-value = 1
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$Neighborhood,cat_subdata2$BsmtFinType1,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$Neighborhood and cat_subdata2$BsmtFinType1
## p-value = 0.0004998
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$Neighborhood,cat_subdata2$LotShape,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$Neighborhood and cat_subdata2$LotShape
## p-value = 0.0004998
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$Neighborhood,cat_subdata2$Utilities,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$Neighborhood and cat_subdata2$Utilities
## p-value = 0.4803
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$BsmtFinType1,cat_subdata2$LotShape,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$BsmtFinType1 and cat_subdata2$LotShape
## p-value = 0.0004998
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$BsmtFinType1,cat_subdata2$Utilities,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$BsmtFinType1 and cat_subdata2$Utilities
## p-value = 0.6892
## alternative hypothesis: two.sided
fisher.test(cat_subdata2$LotShape,cat_subdata2$Utilities,simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: cat_subdata2$LotShape and cat_subdata2$Utilities
## p-value = 0.3678
## alternative hypothesis: two.sided
Our third test of correlation/association that we would perform is the categorical variables against the dependent Y variable (sale price).
For this test, we would use the Analysis of variance. However, we would also want to check for the assumptions first.
ANOVA assumptions:
1. common variance
For common variance, we can run the anova and visually inspect the variance using residual vs fitted, or we can also run the levene test. Here, we will do both.
The results of the residual vs fitted shows the variance do not revolve around 0 as you move across grouping. See the Neighborhood as an example. As you move towards the right on the x-axis, the points become more spreadout. potentially, the only one that may be independent is utilities. The Levene’s test confirms this as utilities is the only one where you would not reject the Null hypothesis.
This is consistent with the prior analysis above
Anova_Neighborhood_Price <-cat_subdata2[,c("SalePrice","Neighborhood")]
Anova_Neighborhood_Price.aov <- aov(SalePrice~Neighborhood, data=Anova_Neighborhood_Price)
Anova_Zone_Price <-cat_subdata2[,c("SalePrice","Zoning")]
Anova_Zone_Price.aov <- aov(SalePrice~Zoning, data=Anova_Zone_Price)
Anova_BsmtFinType1_Price <-cat_subdata2[,c("SalePrice","BsmtFinType1")]
Anova_BsmtFinType1_Price.aov <- aov(SalePrice~BsmtFinType1, data=Anova_BsmtFinType1_Price)
Anova_LotShape_Price <-cat_subdata2[,c("SalePrice","LotShape")]
Anova_LotShape_Price.aov <- aov(SalePrice~LotShape, data=Anova_LotShape_Price)
Anova_Utilities_Price <-cat_subdata2[,c("SalePrice","Utilities")]
Anova_Utilities_Price.aov <- aov(SalePrice~Utilities, data=Anova_Utilities_Price)
plot(Anova_Neighborhood_Price.aov,1)
plot(Anova_Zone_Price.aov,1)
plot(Anova_BsmtFinType1_Price.aov,1)
plot(Anova_Utilities_Price,1)
plot(Anova_LotShape_Price.aov,1)
leveneTest(SalePrice~Neighborhood, data=cat_subdata2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 27 24.522 < 0.00000000000000022 ***
## 2694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(SalePrice~Zoning, data=cat_subdata2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 30.306 < 0.00000000000000022 ***
## 2718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(SalePrice~BsmtFinType1, data=cat_subdata2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 44.641 < 0.00000000000000022 ***
## 2643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(SalePrice~LotShape, data=cat_subdata2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 9.3925 0.000003566 ***
## 2718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
leveneTest(SalePrice~Utilities, data=cat_subdata2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.9 0.3429
## 2720
The 3 variables chosen are:
Y = SALE PRICE
X - nominal - subclass
X - Oridinal - LotShape
X - discrete -Yrsold
subdata3 <- data[,c("SubClass","LotShape","YrSold", "SalePrice")]
subdata3$Log_SalePrice <-log(subdata3$SalePrice)
print(vtable(subdata3))
##
##
## Name Class Values
## -------------- -------- ------------------------
## SubClass integer Num: 20 to 190
## LotShape factor 'IR1' 'IR2' 'IR3' 'Reg'
## YrSold integer Num: 2006 to 2010
## SalePrice integer Num: 35000 to 755000
## Log_SalePrice numeric Num: 10.463 to 13.534
As mentioned in the prior steps, we know that salePrice is not normally distributed. By doing a log transform, we see a more “normal” distribution as some of the data points at the tail end get pulled down..
par(mfrow = c(1,2))
p1<-ggplot(subdata3, aes(sample=SalePrice)) + stat_qq(col = "purple") + stat_qq_line(col = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="theoritical",
y="sample", title = "QQ Plot Sale Price")
p2<-ggplot(subdata3, aes(sample=Log_SalePrice)) + stat_qq(col = "purple") + stat_qq_line(col = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="theoritical",
y="sample", title = "QQ Plot Log Sale")
grid.arrange(p1,p2,nrow=1)
We now inspect scatter plot and boxplot distribution and compare it to sale price and log sale price. A takeaway from this is that by looking at the distribution of log price level, we see outliers being brought down. This is beneficial as we no longer have the influence of outliers in our data.
p1<- ggplot(subdata3, aes(x=subdata3[,"SubClass"], y=SalePrice)) + geom_point(color = "purple") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="SubClass",
y="Price", title = "SubClass vs Price")
p2<- ggplot(subdata3, aes(x=subdata3[,"SubClass"], y=Log_SalePrice)) + geom_point(color = "purple") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="SubClass",
y="LogPrice", title = "SubClass vs LogPrice")
grid.arrange(p1,p2,nrow=1)
p1<- ggplot(subdata3, aes(x=subdata3[,"LotShape"], y=SalePrice)) + geom_boxplot(fill = "purple")+ theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="LotShape",
y="Price", title = "LotShape vs Price")
p2<- ggplot(subdata3, aes(x=subdata3[,"LotShape"], y=Log_SalePrice)) + geom_boxplot(fill = "purple")+ theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="LotShape",
y="LogPrice", title = "LotShape vs LogPrice")
grid.arrange(p1,p2,nrow=1)
p1<- ggplot(subdata3, aes(x=subdata3[,"YrSold"], y=SalePrice)) + geom_point(color = "purple") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="YrSold",
y="Price", title = "YrSold vs Price")
p2<- ggplot(subdata3, aes(x=subdata3[,"YrSold"], y=Log_SalePrice)) + geom_point(color = "purple") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
labs(x="YrSold",
y="LogPrice", title = "YrSold vs LogPrice")
grid.arrange(p1,p2,nrow=1)
All in all, the biggest lift we get in transformation is on the dependent variable. When comparing the log price vs price, the normality appears to improve dramatically. With the predictors, while we see improvement in the outliers, it’s not as dramatic, but may still be beneficial in the modelling steps.
This exercises shows us a couple points:
There are inherent difficulties when working with the original model data. without any adjustments, since the saleprice distribution is not normal, statistical tests and parametric techniques may not work or give accurate results. This requires us to transform the data, which may give better results, but we need to make sure that the interpreation is still accurate.
Transformation at the dependent variable improved the quality of the data and it may be worth while to transform the predictors. However, the 3 that was chosen were not continuous variable so they may or may not add value. It may be worthwhile to go back and choose other variables that can be transformed.
The original ask of this exercise is to predict housing prices. While a simple task at the surface, the data provided needed to be reviewed and cleaned. We ran into data issues such as observations that were not ‘typical’ of an Ames Iowa Home. We ran into commercial grade zones, and we ran into economic types of problems that do not reflect “typical” housing prices.
Once these were removed, we dug into the data itself to look for clerical errors, or logical fallacies. In this portion, we looked at missing values, “typos”, and other concerns with the data. This process taught us and made us decide on what to do when we do spot concerns with the data. While removing more observations is an option, we risk losing too much data. The alternative would be to try to either impute missing values, or try to reasonably assume what that value should have been.
Next, we start our exploratory data analysis. This includes looking at the relationship between our independent and dependent variables. While histogram showed us a good distribution, boxplot and scatter plot made it clearer to look at outliers. Our takeaway at this step is that for the most part, there is an imbalance in most of the variables, and there are outliers with respect to price. At the predictor level, outliers are also present.
We also get to see the frequency of each variable. For example, we now know that most homes cost right around the $200,00 mark, which neighborhood is the most prominent, most homes do not have a pool, square footage between 1000-2000.
One interesting takeaway is that the distribution of year sold is fairly homogenous, which tells us that buyers aren’t tending to buy homes of certain years.
Further into the EDA and looking at the relationship between the variables, we ran multiple correlations and independence test. The result of our correlation told us that there is a relationship between the sale price and the various predictors, but maybe not so much when you look at predictor vs predictor.
Finally, our last portion of the EDA was to think about log transformation. As sales price is not normally distributed, we determe that doing a log transforms normalizes the data and will make our model more valid.
In summary, the data at hand appears to show multiple relationship to price among the variables chosen, so there should be some strength in prediction. The challenge would be to prep the dat before hand in order to ensure it stays within the statistical assumptions.
Predictor-Predictor relationship
for(i in 1:length(numeric_subdata2)){
print(
ggplot(numeric_subdata2, aes(x=numeric_subdata2[,i], y=SubClass)) + geom_point(col = "purple") + theme_classic() + geom_smooth(method='lm', fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust=1))+ geom_smooth(method='lm') +
labs(x=colnames(subdata[i])))
}
for(i in 1:length(numeric_subdata2)){
print(
ggplot(numeric_subdata2, aes(x=numeric_subdata2[,i], y=PoolArea)) + geom_point(col = "purple") + theme_classic() + geom_smooth(method='lm', fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust=1))+ geom_smooth(method='lm') +
labs(x=colnames(subdata[i])))
}
for(i in 1:length(numeric_subdata2)){
print(
ggplot(numeric_subdata2, aes(x=numeric_subdata2[,i], y=YearBuilt)) + geom_point(col = "purple") + theme_classic() + geom_smooth(method='lm', fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust=1))+ geom_smooth(method='lm') +
labs(x=colnames(subdata[i])))
}
for(i in 1:length(numeric_subdata2)){
print(
ggplot(numeric_subdata2, aes(x=numeric_subdata2[,i], y=TotRmsAbvGrd)) + geom_point(col = "purple") + theme_classic() + geom_smooth(method='lm', fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust=1))+geom_smooth(method='lm') +
labs(x=colnames(subdata[i])))
}
for(i in 1:length(numeric_subdata2)){
print(
ggplot(numeric_subdata2, aes(x=numeric_subdata2[,i], y=Fireplaces)) + geom_point(col = "purple") + theme_classic() + geom_smooth(method='lm', fill = "red") +
theme(axis.text.x = element_text(angle = 45, hjust=1))+geom_smooth(method='lm') +
labs(x=colnames(subdata[i])))
}